Intro

This notebook shows a complete example of a simple step-by-step analytical process for a sample dataset ("adult census").

Its main goal is to present how a report should be structured. Of course, the analysis process details can depend on a specific problem, but the general structure should be similar to the one presented below.

Preparation

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as sts

from pandas_profiling import ProfileReport

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import cross_val_score, KFold, train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score, f1_score
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder

Load data

In [2]:
data= pd.read_csv("adult.csv")
In [3]:
data.head(3)
Out[3]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country earnings
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
In [4]:
data.dtypes
Out[4]:
age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
earnings          object
dtype: object

EDA

An Exploratory Data Analysis part should happen here. It will depend on data types and research problem. An automated exploration library (Pandas Profiling) was used to save time and space in this example.

Typically, EDA should be extensively commented - both from a technical perspective as well as business-related one.

In [5]:
profile = ProfileReport(data, title='Dataset Report', explorative=True)
In [7]:
profile.to_notebook_iframe()

Prepocess data

Data preparation steps happen here - all necessary operations to make the data usable in a further modeling process.

Encode y

In [11]:
X, y = data.drop("earnings", axis=1), data.earnings
In [12]:
y
Out[12]:
0        <=50K
1        <=50K
2        <=50K
3        <=50K
4        <=50K
         ...  
32556    <=50K
32557     >50K
32558    <=50K
32559    <=50K
32560     >50K
Name: earnings, Length: 32561, dtype: object
In [13]:
y.unique()
Out[13]:
array(['<=50K', '>50K'], dtype=object)
In [14]:
y = LabelEncoder().fit_transform(y)
In [15]:
y
Out[15]:
array([0, 0, 0, ..., 0, 0, 1])
In [16]:
X_enc = pd.get_dummies(X)
In [17]:
X_enc
Out[17]:
age fnlwgt education-num capital-gain capital-loss hours-per-week workclass_? workclass_Federal-gov workclass_Local-gov workclass_Never-worked ... native-country_Portugal native-country_Puerto-Rico native-country_Scotland native-country_South native-country_Taiwan native-country_Thailand native-country_TrinadadTobago native-country_United-States native-country_Vietnam native-country_Yugoslavia
0 39 77516 13 2174 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 50 83311 13 0 0 13 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
2 38 215646 9 0 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
3 53 234721 7 0 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
4 28 338409 13 0 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32556 27 257302 12 0 0 38 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
32557 40 154374 9 0 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
32558 58 151910 9 0 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
32559 22 201490 9 0 0 20 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
32560 52 287927 9 15024 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0

32561 rows × 108 columns

Train test split

Typically, test data is saved for final evaluation. All algorithms tuning and parameters search happen on train data (which can be divided again into train-validation).

In [19]:
X_train_val, X_test, y_train_val, y_test = train_test_split(X_enc, y, test_size=0.2, random_state=123) 
In [20]:
X_train_val.head(3)
Out[20]:
age fnlwgt education-num capital-gain capital-loss hours-per-week workclass_? workclass_Federal-gov workclass_Local-gov workclass_Never-worked ... native-country_Portugal native-country_Puerto-Rico native-country_Scotland native-country_South native-country_Taiwan native-country_Thailand native-country_TrinadadTobago native-country_United-States native-country_Vietnam native-country_Yugoslavia
17064 19 219300 10 0 0 25 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
18434 58 116901 9 0 0 25 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
3294 43 220589 9 0 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0

3 rows × 108 columns

In [21]:
X_test.head(3)
Out[21]:
age fnlwgt education-num capital-gain capital-loss hours-per-week workclass_? workclass_Federal-gov workclass_Local-gov workclass_Never-worked ... native-country_Portugal native-country_Puerto-Rico native-country_Scotland native-country_South native-country_Taiwan native-country_Thailand native-country_TrinadadTobago native-country_United-States native-country_Vietnam native-country_Yugoslavia
20713 55 199713 13 0 0 15 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
13495 65 115890 13 0 0 20 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
12367 29 145592 9 0 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

3 rows × 108 columns

Classification prep

Usually, multiple algorithms are tested. A good practice is to pick one from each promising family (trivial algorithms, tree-based models, ensembles, etc).

In [22]:
algorithms = {
    'knn': KNeighborsClassifier(),
    'dt': DecisionTreeClassifier(),
    'rf': RandomForestClassifier()
}
In [23]:
kfold = KFold(n_splits=15, random_state=456)
/Users/filipwojcik/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_split.py:293: FutureWarning: Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.
  warnings.warn(

Grid search for decision tree

It would be best if you did this for every single classifier in your set. It can take some time to complete the job, so the example here is presented only for decision trees.

In [24]:
tree_params_grid = {
    'max_depth': [3, 5, 10, 20],
    'criterion': ['gini', 'entropy'],
    'max_features': [None, 5, 10, 20]
}
grid_tree = GridSearchCV(DecisionTreeClassifier(), tree_params_grid, cv=kfold, n_jobs=3)
grid_tree_results = grid_tree.fit(X_train_val, y_train_val)
In [25]:
# check which estimator is the best?
grid_tree_results.best_estimator_
Out[25]:
DecisionTreeClassifier(criterion='entropy', max_depth=10)

Save the best estimator as your algorithm of choice to compare with the others

In [26]:
algorithms['dt'] = grid_tree_results.best_estimator_

Run cross validation to compare classifiers

If only you have sufficient data volume - perform cross-validation. Judging algorithms' performance using a single trail is misleading. It should always be verified multiple times to avoid false discoveries.

In [27]:
results = {}
for algo_name, algo in algorithms.items():
    algo_results = cross_val_score(algo, X_train_val, y_train_val, cv=kfold, n_jobs=4)
    results['model_' + algo_name] = algo_results

Compare results

Once results from multiple algorithms have been collected - an analyst should compare them using statistical tools. Typically - statistical tests for the difference of means are performed. Judging if one algorithm outperforms the other by eye is not the best idea for scientific research.

Prepare data

In [28]:
results_df = pd.DataFrame.from_dict(results)
In [29]:
results_df.mean(axis=0)
Out[29]:
model_knn    0.772305
model_dt     0.854730
model_rf     0.854499
dtype: float64
In [30]:
results_df.std(axis=0)
Out[30]:
model_knn    0.011349
model_dt     0.008912
model_rf     0.007616
dtype: float64
In [31]:
results_df
Out[31]:
model_knn model_dt model_rf
0 0.781232 0.859528 0.860104
1 0.765688 0.853771 0.858377
2 0.775475 0.849741 0.851468
3 0.773172 0.838803 0.842257
4 0.757052 0.847438 0.848590
5 0.782383 0.875072 0.868164
6 0.776051 0.850317 0.855498
7 0.770294 0.865285 0.862982
8 0.772465 0.856567 0.859447
9 0.776498 0.853687 0.849654
10 0.763249 0.847350 0.841014
11 0.764401 0.845622 0.847926
12 0.770737 0.861175 0.858295
13 0.754608 0.857143 0.859447
14 0.801267 0.859447 0.854263

Statistical tests

Friedman chi-square test to check if IN GENERAL there are differences between classifiers

In [32]:
sts.friedmanchisquare(results_df.model_knn, results_df.model_rf, results_df.model_dt)
Out[32]:
FriedmanchisquareResult(statistic=22.80000000000001, pvalue=1.1195484842590875e-05)

We can see that differences are important. Now we should perform POST HOC tests to see which classifiers are different

More details about post-hoc tests can be found here

Method 1: statsmodels library

First option will be to use statsmodels library, that can run post-hoc tests of multiple comparisons. Documentation can be found here.

WARNING - this library is unstable and changes often. This tutorial might not be actual after couple of days.

Multiple comparisons procedure requires dataframe in a form:

Method Score
Classifier1 score1
Classifier1 score2
In [33]:
from statsmodels.sandbox.stats.multicomp import MultiComparison
In [34]:
results_df2 = results_df.copy()
results_df2['id'] = results_df2.index
results_reformatted = pd.melt(results_df2, id_vars='id').drop('id', axis=1)
In [35]:
results_reformatted
Out[35]:
variable value
0 model_knn 0.781232
1 model_knn 0.765688
2 model_knn 0.775475
3 model_knn 0.773172
4 model_knn 0.757052
5 model_knn 0.782383
6 model_knn 0.776051
7 model_knn 0.770294
8 model_knn 0.772465
9 model_knn 0.776498
10 model_knn 0.763249
11 model_knn 0.764401
12 model_knn 0.770737
13 model_knn 0.754608
14 model_knn 0.801267
15 model_dt 0.859528
16 model_dt 0.853771
17 model_dt 0.849741
18 model_dt 0.838803
19 model_dt 0.847438
20 model_dt 0.875072
21 model_dt 0.850317
22 model_dt 0.865285
23 model_dt 0.856567
24 model_dt 0.853687
25 model_dt 0.847350
26 model_dt 0.845622
27 model_dt 0.861175
28 model_dt 0.857143
29 model_dt 0.859447
30 model_rf 0.860104
31 model_rf 0.858377
32 model_rf 0.851468
33 model_rf 0.842257
34 model_rf 0.848590
35 model_rf 0.868164
36 model_rf 0.855498
37 model_rf 0.862982
38 model_rf 0.859447
39 model_rf 0.849654
40 model_rf 0.841014
41 model_rf 0.847926
42 model_rf 0.858295
43 model_rf 0.859447
44 model_rf 0.854263
In [36]:
multicomp = MultiComparison(
    data=results_reformatted.value,
    groups=results_reformatted.variable,
)
print(multicomp.tukeyhsd())
   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
=========================================================
  group1    group2  meandiff p-adj  lower   upper  reject
---------------------------------------------------------
 model_dt model_knn  -0.0824 0.001 -0.0908 -0.0741   True
 model_dt  model_rf  -0.0002   0.9 -0.0086  0.0081  False
model_knn  model_rf   0.0822 0.001  0.0738  0.0906   True
---------------------------------------------------------

We can see, that all differences are important EXCEPT random forest vs tree - there's no significant difference between them (remember: we have tuned Decision Tree as much as we can)

Method 2: Scipy-posthocs library

You can use scipy-posthocs library.

This library uses the same format as statsmodels above.

We will use Tukey pos-hoc test for Friedman test (multiple models).

WARNING this library is also experimental and can change rapidly. Some of those

In [37]:
import scikit_posthocs as sp
In [38]:
pvals = sp.posthoc_tukey(results_reformatted, val_col='value', group_col='variable', )
pvals < 0.05
Out[38]:
model_knn model_dt model_rf
model_knn False True True
model_dt True False False
model_rf True False False
In [39]:
pvals
Out[39]:
model_knn model_dt model_rf
model_knn 1.000 0.001 0.001
model_dt 0.001 1.000 0.900
model_rf 0.001 0.900 1.000

Conclusions

Once the analysis is done, and the performance of different methods has been compared - one should prepare a final chapter presenting the following aspects:

  • Introduction - what was the research question, hypothesis, and the goal of the study;
  • Methods - which methods were proven to be the best;
  • Results - what are the final results;
  • Discussion - possible future extensions to the method, as well as study limitations.